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ABSTRACT 

This report documents results of a two-year effort 
toward the study and investigation of the design of a prototype 
system for Chinese- English machine translation in the general area of 
physics. Previous work in Chinese- English machine translation is 
reviewed. Grammatical considerations in machine translation are 
discussed and detailed aspects of the Berkeley qrammar It and 
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Abstract 



This report documents results of a two-year effort 
toimrd the study and Investigation of the design of a prototype 
«y»tem for Chinese-English machine translation in the general 
area of physics. Past research efforts at Berkeley had been 
centered on three areas: (1) contrastive study of Chinese and 

English, (2) development of an automatic dictionary and 
(3) programming support for machine implementation. 

Our work on grammar (Berkeley Grammar II) in the past 
two years has focused on the expansion and consolidation of our 
syntactic rules. Grammar codes in the dictionary and in the rules 
are reviewed for consistency, and redundancies are eliminated. 
Further sets of rules are added as a result of the continuing 
testing and revision of our grammar based on texts in nuclear 
physics and also on previously existing texts in biochemistry. 

The statistics on our last run shows that our Syntactic Analysis 
System (SAS) is able to recognize and parse satisfactorily 
strings consisting of 20-25 Chinese characters, indicating the 
ability of the SAS to consistently parse 90 % of such sentences. 

Testing and implementation of interlingual transfer 
rules has concentrated on the conversion of Chinese nominalizations 
and relativizations to their English counterparts by implementing 
binary permutations and substitution of lexical Chinese-English 
data. Work is continuing on the implementation of English 



complementizers 



Approximately 15*000 Chinese lexical entries, mostly in 
the area of nuclear physics, have been compiled, coded into 
standard telegraphic code and added to our dictionary, which now 
totals approximately 57*000 lexical entries. These entries have 
been assigned grammar codes, romanization and English gloss* 



Programs for the Syntactic Analysis System were written 
in CDC FORTRAN IV and COMPASS and run entirely on the CDC 6400. 

In addition to the continual refinement of the managerial, adapta 
tion and parsing programs of the SAS, we have implemented plot- 
ting routines on the Calcomp plotter to output structural trees, 
as well as plotting the corresponding sentences in Chinese 
characters. 



Documentation for all completed programs of the SAS are 
included sin this report . V 




I. Introduction 



The project on machine translation (MT) of Chinese to English 
at Berkeley has teen in continuous existence for ten years. Re- 
search during this period may he roughly considered to have pro- 
gressed in three stages. Phase One concentrated on lexicograph- 
ical studies . Major research completed during this period (i960 
to 1967) has resulted in the compilation of a dictionary (with 
subject matter mainly in the area of biochemistry). Phase Two 
initiated syntactic studies near the end of this period with the 
creation of a grammar for the automatic analysis of Chinese 
sentence structure. 



Phase Three is the integration of the previous two phases 



leading to interlingual studies concomitant with lexicographical 
and syntactic research. The specific subject matter is now in 



the area of nuclear physics. In the Final Technical Report on 
work accomplished in the contractual period immediately prior to 
the present one (1967 to 1968) , Version I of the Syntactic Anal- 
ysis System (SAS) was documented. SAS is a package of computer 
programs capable of accepting coded Chinese sentences as input 
and producing syntactic trees representing the resulting analysis 



of the Chinese sentences . Limited interlingual mechanisms were 



also implemented in these structures. 



o-: 



The present report documents the results 
in this area during the period^ of, September 1968 through 

0 ;:; 1970. /V::-' 

v : v>. 3 : 



of further work 
August 




II. The Background in Chines e-English MT 



In order to provide a certain perspective regarding the 
work under report , a brief survey of previous work in Chinese- 
English MT is presented below. There is no denying the fact 
that present work has stood on the foundations of earlier efforts 
in the field and inherited its success and problems. At the same 
time it also uncovered further problems as well as some new methods 
of solution consonant with the state of the art. 



II. 1. Georgetown * 

The work at Georgetown in the late 1950’s was not focused 
principally on Chinese but was part of a general assault on MT. 

As was to be expected, initial work concentrated on problems of 
Chinese text input. Standard telegraphic code was used as the 
most natural for computer input. The translation process was 
largely dependent on the result of lexicon look-up. Only a very 
small sample was used. Sentences had to be processed indepen- 
dently by sophisticated "trial-and-error" procedures , since no 
appropriate "grammar" was available for this purpose. 






II .2. University of Washington , Bunker -Ramo , University of Texas 



The work done at the University of Washington under Pro- 
fessor -Eiwiri Reiffer was the! spirit^of of 

comparative Chi^^ structure and ; lexicography .' Aga^ 



o 
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the emphasis had to be on the compilation of* an adequate glossary 
of Chinese. Reifler, early in his report [Pinal Report to the 
NSP, 1962], pointed out the differences in style, constructions 
and allowable forms of scientific publications and that no scho- 
larly descriptions of the Chinese language had yet dealt with 
this aspect of the language. The corpus was not restricted to 
one field or subfield of knowledge but rather to Chinese scien- 
tific texts in general. The choice of such an approach to the 
corpus was apparently dictated not only by the difficulty of 
obtaining sufficient material from one subject field at that time 
but also the hope of gaining "a more representative picture of 
the general -language problems of the language of science and 
technology" [NSF Report, pp . 12 - 13 ] . As will be seen in our 
report below, this approach is still premature to a certain extent 
and it would have been better to restrict our goal to one field. 
The effort was concentrated on the study of a refined glossary 
which would give better word-for-word translations into English. 
The study produced a glossary of 1880 terms which are of some 
linguistic value. 




;;tv jgiUThe work at Bunker-Ramo was an application of the Fulcrum 
technique in cooperation with Berkeley and University of Texas. 



Their work was the development of interlingual mapping and English 
generatibn pHases Chinese -English MT by adapting the Berkeley 

'parser representing early efforts of Phase Three 

• of :!;-the- fBerkeley^MT system. It should be noted that : the pa.rtial; 
system implemented by Bunker-Ramo us ed the SN 0 B 0 L 3 language v • 



whereas FORTRAN was the mainstay of the Berkeley system and there- 
fore not directly interfaceable. Texas took on the responsibility 
of expanding and supplementing the Berkeley dictionary. Since 
1968 the Berkeley project has also assumed the task of imple- 
menting these final phases as well as updating of the dictionary. 



II. 3. Peking 



By 1958 the Linguistic Research Institute of the Chinese 
Academy of Sciences in Peking has already established a system 
for the translation of Russian to Chinese. Although the present 
survey is concerned with work done in the mechanical translation 
of Chinese as the source language whereas the work in Peking has 
Chinese as the target language 9 the fact that mechanization invol- 
ving Chinese is involved should not be ignored. Their initial 
investigations took the approach of analysing both languages in- 
dependently and then attempting translation based on the results 
of such analysis . They later (1961) changed their approach to 
that of emphasizing the contrastive analysis aspects even during 
initial analysis. It should be noted that since their work was 
Russian to Chinese 9 many more surface morphological aspects of 
the source language were available as compared to translation of 
Chinese to English 9 in which information on grammatical categories 
such as plurality, person, noun -verb dis tinctions , tense -aspect- 
mood systems, for example, are not well-marked by surface mor- 
pholo®r jand cause many dif f icul tie s in the initial phase of con- 
trastive analysis. 
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III. Grammatical Considerations in MT 



III.l. Linguistic Analysis and MT 



Research into experimental MT must be based on a well 
defined framework of linguistic analysis. The results that have 
been achieved thus far and the results that can be expected are 
predictably circumscribed by the particular framework chosen. 
Moreover, such results are also evidence of the range of limita- 
tions and usefulness inherent in the particular theoretical 
framework. There are important differences between the goals of 
research in linguistic theory and the goals of research in exper- 
imental MT. Although theoretical research is concerned with the 
totality of linguistic competence, actual instances of such 
research activities usually focus on particular aspects of this 
totality. The general approach is that of deduction. Thus, for 
example, a proposed explanation for complementation in a language 
in general does not exhaustively take into consideration all 
of the verbs in the language. Furthermore, in practice, the 
rules proposed are never exhaustively crosschecked against others 
in the language. On the other hand research into experimental 
MT and other activities in computational linguistics must be 
constantly concerned with the;, total range of exhaustive applica- 
tion of ; the results of the more theory-oriented :rSsearch. ■ ^ 






Thus ^nade^ the theory oriented desci’iptibns are 



fre- 



quently and 



constantly unearthed by the more esdiaustive concerns 



of Computational iing^ MT. This underlines the fact 
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that research in MT, not only in theory but also in practice, is 
concerned with the totality of the descriptive adequacy of the 
grammar. ^ a 

Fundamental to all linguistic activities and all other 
scientific activities is the concern for the basic units in the 
system. In the case of grammar these units will be the gramma- 
tical categories. Grammatical concepts , in the sense of Boas 
and Sapir, are manifested in the surface structures of sentences 
in a language, and different languages will have different mani- 
festations Of such grammatical concepts. 1 Furthermore, semanti- 
cally corresponding sentences in different languages may utilize 
different grammatical concepts and categories, which are realized 
differently . Numerous scholars have attempted to discuss and 
outline the nature and range of grammatical concepts. As an 
example, we may quote Sapir* s analysis of "The farmer kills the 
duckling." 

I. Concrete Concept: 

1. First subject of discourse: farmer 

2. Second subject of discourse: duckling 

3. Activity: kill 

analyzable into: 



1 ?See .1 for example . Boas v Franz . 1970 . Introduction to the 1: '■ ' ■ 
^ : v Handbook of American Indian Languages. (Ed.) Preston Holder 
Uniyersity of Nebraska Pres s , Lincoln ; Sapir, Edward. ; 1921 i 

ionfto^the .5 Study o:^ A-J.;, 

Brace, ^New“Yorks Jesp er son ^ G t to . -iy24 .”The Phi lo Sophy of Modern 
Grammar# New York; Tesniere , Luc ien . 195 $ . -Elements deHffyn^XeT . v 
•v : .-i o ^ ’ StruCtur&iS.i; ^ Paris [ 14 ] $ : and i Jakc^sonj 1 - Roman. -rigsY*? ?••• "'Boas ? ;Pi^ v, 
view of grammatical meaning," in American Anthropologist 6ls5» 

• y ■;.? ;K;p ■ - 1 ■ ~ ~ 




A. Radical concepts: 

1 • Verb : ( to ) farm 

2. Noun: duck 

3. Verb: kill 



B. Derivational Concepts: 

1. Agentive: ; expresses by suffix -er 

2 . Diminutive: expressed by suffix -ling 

II. Relational Concepts : 

Reference: : : y’-j;'; ’! 

1. Definiteness of reference to first subject of 
: discourse: expressed by first the, which 

r ^ has preposed^positioii. 

2 « Definiteness of reference to second subject 
of discourse: expressed by second the , which 

has preposed position. 




: Modality: " • . ; , ■' ::v : * .;•// 

3* co Declarati ve: ;. : expressed by sequence of "sub- 
ject" plus verb 5 and implied by suffixed -s 
/ Personal relations: '/S'-.; - : 

: t Subjectivity of farmer : expressed by posi- 



tion of farmer before kills ; and b.v suffixed 



Objectivity of duckling : e3q)ress«Eid; by posi- 

; : ^tion; of duckling after kflls' ' < ■ 

Number: ; 1 



'^Sirigulari ty of first I sub ject of discbdrse: 















expressed by lack of plural suffix in farmer 
and by suffix -§ in following verb 



Time: 




In this short sentence of five words there are ex- 
pressed, therefore, thirteen distinct concepts , of 
which three are radical and concrete, two derivational, 
and eight relational.... 

Our analysis may seem a little belabored, but only 
because we are so accustomed to our own well-worn 
grooves of expression that they have come to be felt 
as inevitable. Yet destructive analysis of the famil- 
iar is the only method of approach to an understanding 
of fundamentally different modes of expression. ... 

A cursory examination of other languages, near and 
far, would soon show that some or all of the thirteen 
concepts that our sentence happens to embody may not 
only be expressed in different form but that they may 
be differently grouped among themselves; that some 
among them may be dispensed with; and the other concepts, 
not considered worth expressing in English idiom, may be 
treated as absolutely indispensable to the intelligible 
rendering of the proposition. ^ 

In the Chinese sentence "Mari kill duck" which may 
be looked upon as the practical equivalent of ’’The 
man kills the duck, " there is by no means present 
for the Chinese consciousness that childish; halting, 
empty feeling which we experience in the English 
translation. The three concrete concepts — two ob- 
jects and an action — are each directly expressed 
by a monosyllabic word which is at the same time a 
radical element, the two related concepts — "subject" 
and "object" — are expressed solely by the position 
of the concrete words before arid after the word of 
action .And that is all. Definiteness or indefinite- 
riess of reference, number, personality as an inherent 
•aspect of the verb, not to speak of gender --, all 
these are given no expression in the Chine se sentence, 
which, for all that, is a perfectly adequate communi- 
•> cation -- provided, of course, there is that context, 
that background of mutual understanding that is 



10 
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essential to the complete intelligibility of all 
speech. Nor does this qualification impair our 
argument, for in the English sentence too we leave 
unexpressed a large number of ideas which are either 
taken for granted or which have been developed or 
are about to be developed in the course of the con- 
versation. • • • ( Language p. 88ff) 



On the basis of Sapir's analysis one cannot fail to de- 
duce that perhaps it would be easier to translate from English 
into Chinese than from Chinese into English. We have already 
mentioned the research activities in MT at Peking University. 

It was found that at the level of morphology it is definitely 
simpler going from Russian to Chinese than in the reverse direc- 
tion. The subsequent deduction is that the same holds true for 
English and Chinese MT. This would mean that for the final 



English output of Chinese to English MT system we would have to 
devote some efforts toward establishing categories that may be 
absent in the source language. In the present framework of 
discussion it would seem that the mechanics of translation may 
vary under the first approach (pairwise approach) . On the one 



extreme, it could be a word-to-word translation which is quite 
similar to expressing a message content in the target language 
using the inventory of grammatical concepts of the source langu- 
age. Abundant examples-; of this could be found in the early 
attempts at experimental machine translation (such as the projects 
at the University of Washington and Georgetown University). 

Under the second approach (many language approach), on the other 
ex^r^me, it would involve several steps: (1) establishing the ./■ •' 

se ^ °-^ S^ammatical categories (some scholars have 




associated this with an [intermediate] universal language ) / 

(2) establishing the correspondences between the source language 
and this universal set (or translating into the intermediate 
language — attempts at translating natural language into first 
order predicate calculus may be seen as one attempt in this 
direction ) 9 and (3) establishing the correspondences between th« 
intermediate language and the target language . ; This will ensur« 
that no grammatical information may be missing and will also 
facilitate n- tuple interlingual translations. We are not aware 
of there having’ been serious and persistent attempts with this 
approach in mind, beyond the theoretical level. 

We can bring our discussion to a more concrete level: 
with reference to the following diagrams 





Table 1: "Overt" grammatical cata gor ies in the surface 

structures of source and' garget languages : ' ^ ; 









Under the first approach, only L and L. - are considered 

S T*± 

and all features marked for L (i.e., B, C, and E) are marked 

s 

for regardless of whether they are utilized in and 
regardless of others which may he called into play in L. . In 

wJ. 

the second approach, more than one target language may be con- 
sidered and all features are marked regardless of their utility 
in the languages concerned. For example, even though G may not 
be called for in any of the languages concerned, yet it is marked 
for all the languages concerned. 

In an approach to MT that is guided by a practical con- 



cern for Immediate results, it is necessary 



to pursue an inter- 



mediate course by attempting to analyze L in the parsing pro- 

s 

gram with a set of grammatical features (codes) that is the union 

of the set of grammatical features in Chinese and English. For 

example, plurality, which is generally not overtly expressed in 

Chinese, is recognized, while nominal classifiers , which are 

generally not overtly expressed for non-mass nouns in English, 

are also recognized. Consider another example based on the dis- 

2 

tinction between partitive and non-partitive genitive in the 
two languages under study. The underlying structure for parti- 
tive genitive is vastly different from non-partitive genitive. 

For example, in the following sentences 



This is sometimes known as subjective genitive arid objective 
genitive in the discussion of Sanskrit and the other languages. 
Subjective genitive is where the noun in the genitive case is 
the subject of the underlying sentence, whereas the object of the 
underlying sentence becomes the noun in the genitive case in 
objective grammar. 
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